libxl: event: Fix hang when mixing blocking and eventy calls
If the application calls libxl with ao_how==0 and also makes calls
like _occurred, libxl will sometimes get stuck.
The bug happens as follows (for example):
Thread A
libxl_do_thing(,ao_how==0)
libxl_do_thing starts, sets up some callbacks
libxl_do_thing exit path calls AO_INPROGRESS
libxl__ao_inprogress goes into event loop
eventloop_iteration sleeps on:
- do_thing's current fd set
- sigchld pipe if applicable
- its poller
Thread B
libxl_something_occurred
the something is to do with do_thing, above
do_thing_next_callback does some more work
do_thing_next_callback becomes interested in fd N
thread B returns to application
Note that nothing wakes up thread A. A is not listening on fd N. So
do_thing_* will not spot when fd N signals. do_thing will not make
further timely progress. If there is no timeout thread A will never
wake up.
The problem here occurs because thread A is waiting on an out of date
osevent set.
There is also the possibility that a thread might block waiting for
libxl osevents but outside libxl, eg if the application used
libxl_osevent_beforepoll. We will deal with that in a moment.
See the big comment in libxl_event.c for a fairly formal correctness
argument.
This depends on libxl__egc_ao_cleanup_1_baton being called everywhere
an egc or ao is disposed of. Firstly egcs: in this patch we rename
libxl__egc_cleanup, which means we catch all the disposal sites.
Secondly aos: these are disposed of by (i) AO_CREATE_FAIL
(ii) ao__inprogress and (iii) an event which completes the ao later.
(i) and (ii) we handle by adding the call to _baton. In the case of
(iii) any such function must be an event-generating function so it has
an egc too, so it will pass on the baton when the egc is disposed.
Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Tested-by: George Dunlap <george.dunlap@citrix.com>
---
v2: Call libxl__egc_ao_cleanup_1_baton (renamed from __egc_cleanup) on
all exits from ao_inprogress, even requests for async processing.
Fixes a remaining instance of this bug (!)
This involves disposing of ao->poller somewhat earlier.
v2: New correctness arguments in libxl_event.c comment and
in commit message.